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Abstract. The advancement of mobile technologies and the prolifer- 
ation of map-based applications have enabled a user to access a wide 
variety of services that range from information queries to navigation sys- 
tems. Due to the popularity of map-based applications among the users, 
the service provider often requires to answer a large number of simul- 
taneous queries. Thus, processing queries efficiently on spatial networks 

Q(i.e., road networks) have become an important research area in recent 
years. In this paper, we focus on path queries that find the shortest path 
C/3 between a source and a destination of the user. In particular, we ad- 

, dress the problem of finding the shortest paths for a large number of 

simultaneous path queries in road networks. Traditional systems that 
consider one query at a time are not suitable for many applications due 
^ to high computational and service costs. These systems cannot guarantee 

required response time in high load conditions. We propose an efficient 
group based approach that provides a practical solution with reduced 
cost. The key concept for our approach is to group queries that share a 
common travel path and then compute the shortest path for the group. 
Experimental results show that our approach is on an average ten times 
^-H faster than the traditional approach in return of sacrificing the accuracy 

by 0.5% in the worst case, which is acceptable for most of the users. 
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<^ 1 Introduction 



With the proliferation of GPS-enabled mobile technologies, users access a wide 
variety of location-based services (LBS) from different service providers. These 
LBS range from simple information queries such as finding the nearest restaurant 
to navigational queries such as finding the shortest path to a destination. In this 
paper, we focus on path queries that find the shortest path between a source and 
a destination of the user. In particular, wc address the problem of finding the 
shortest paths for a large number of simultaneous path queries in road networks. 
Traditional systems that consider one query at a time are not suitable for many 



applications as these systems cannot guarantee a cost-effective and real-time 
response to the user in high load conditions |18l29j . We propose an efficient 
group based approach that provides a practical solution for path queries with 
reduced computational cost. 

In a road network, users are often interested in a service (e.g., finding the 
restaurant or the path to the destination) that can be reached in minimum 
travel time. Since travel time on a road segment is highly dynamic and depends 
on various real-time traffic conditions [7] , it is not possible to accurately compute 
the travel time based on the network distance. Thus, to answer such user queries, 
an LBS provider needs to gather real time traffic conditions of the underlying 
road networks. However, it may not be possible for every LBS provider to have 
their own monitoring infrastructure for real traffic updates. Therefore, to process 
queries on road networks, LBS providers subscribe to map based services such as 
Google Maps [M], MapQuest [H], Yahoo Maps [28], and Microsoft Bing Maps [4] 
for traffic updates. 

Due to huge client bases and popularity of map-based services, the LBS 
server often require to respond to a large number of simultaneous user queries. 
Thus, efficient processing of a large number of queries in road networks have 
become an important research area in recent years. Specially, when an LBS 
server needs to call the map based services for every user request, it becomes 
a major bottleneck in providing a cost-effective and real-time response to the 
user [2^. The underlying reason of the problem is as follows. First, map based 
web services charge on per request basis (e.g., in Google Maps [Hj, an evaluation 
user can submit 2,500 requests per day and a licensed business user can submit 
100,000 requests per day [Tl) and thus need to pay more for more user requests. 
Second, each map based web service call incurs a huge delay in response, e.g., an 
web service call to fetch the travel time from the Microsoft MapPoint web service 
to a database engine takes 502 ms [18]. In addition to the above problems, well 
known approaches (e.g., Dijkstra [31] and A* Algorithm ,22j ) for the shortest 
path computation require expensive graph traversal operations, and thus incur 
huge computational overhead specially for a large number of queries. To alleviate 
all these problems, we propose a grouping approach for an LBS server that 
efficiently process a large number of simultaneous path queries in road networks. 

The key concept of shared execution of a group of path queries comes from the 
path coherence properties of road networks [25 . Path coherence is the concept 
that shows that the shortest paths from nearby sources to nearby destinations 
share a large common path among them. That is, two spatially close source 
vertices Si; S2 share large common road segments of their shortest paths to reach 
two spatially close destination vertices di;d2, which are far from Si;s2. For 
example, it has been observed that the 30,000 shortest paths between two subsets 
of sources and destinations in a road network of Silver Spring, MD pass through 
a single common vertex |25j . 

Based on the above observation, to find a subgroup from a large number of 
path queries, we first find a cluster of source-destination pairs based on similar- 
ities of the Q-lines, where a Q-line is defined as the connecting straight line be- 



tween a source and destination of a given path query. We introduce the distance 
function to measure the similarities among Q-hnes and the areas of influence 
to prune the search space while clustering similar Q-lines. These concepts form 
the bases of our clustering algorithm that returns a set source-destination re- 
gions' clusters. Each cluster essentially is a group of path queries who have high 
probabilities of sharing a common travel path in their answers. After comput- 
ing the clusters, for each cluster, we compute the shortest path from a source 
region to its corresponding destination region, by only considering the outgoing 
edges and incoming edges of the source and the destination regions, respectively. 
Finally, each individual path query is evaluated by concatenating the three short- 
est paths: the shortest path from the source location to the starting point of the 
group shortest path, the group's shortest path, and the shortest path from the 
end point of the group shortest path to the destination location. 

Our group based heuristic to answer the shortest paths for a large number 
of simultaneous path queries significantly reduces the computational overhead 
and ensure real-time response to the users. Though, the group based approach 
does not guarantee optimal shortest path for all queries, the deviation from the 
optimal paths is found negligible. Extensive experimental study in a real road 
network show that our group based heuristic approach is on an average ten 
times faster than the straightforward approach that evaluates each path query 
independently, in return of sacrificing the accuracy by 0.5%, which is acceptable 
for most of the users. 

In summary, the contributions of this paper are as follows: 

— We formulate the problem of group based path queries in road networks. 

— We develop an efficient clustering technique to group path queries based on 
similarities of Q-lines that form the base of our efficient solution to process 
a large number simultaneous path queries in a road network. 

— We conduct an extensive experimental study to show the efficiency and the 
effectiveness of our approach. 

2 Related Works 

To handle a large number of queries in modern database systems, shared execu- 
tion of queries have recently received a lot of attention [218127130] . The core idea 
of all these approaches is to group similar queries (i.e., who share some common 
execution path) and then execute the group as a single query in the system. 
These approaches are found to be effective for many applications in handling 
high load conditions, whereas traditional systems that consider one query at a 
time fail to deliver the required performance for such applications. In this pa- 
per, we propose a shared execution approach for path queries on road networks, 
which is the first attempt of this kind. 

The problem of finding the shortest path from a source to a destination 
on a graph (or spatial/road network) has been extensively studied in litera- 
ture (e.g., |12)22)31j ). Dijkstra's [31] algorithm is the most well-known approach 



for computing a single source shortest path with non- negative edge cost. Dijk- 
stra [3T] incrementally expands the search space, starting from the source, along 
network edges until the destination is reached. Hence, Dijkstra's algorithm re- 
quires to visit nodes and edges that are far away from the actual destination. To 
improve the efficiency of the above algorithm, few variations of Dijkstra's algo- 
rithm have been developed (e.g., [I2])- An alternative school of thought employs 
hill climbing algorithms. A* search [22] and RBFS ^24j, that use heuristics, e.g., 
Euclidian distances, to prune the search space. However, both Dijkstra's and 
hill climbing algorithms incur expensive graph traversal operations and require 
complete recomputation on every update. These approaches |12I22I31] assume 
that the road conditions (e.g., traffic jam) will remain static. 

To address the dynamic load conditions of road networks, several approaches 
have been proposed |17ll9l23j . Dynamic SWSF-FP |23] does not regenerate new 
path for every update rather it only reconstructs the area affected by the update 
or changes in the environment. However, the time complexity of this algorithm 
can be very high as it is proportional to the number of affected nodes. Dynamic 
variants of A* search, such as, dynamic anytime A * [TS] and life long planning 
A* [T7], have also been proposed to handle dynamic load conditions. The main 
idea behind these approaches is to keep registered routes to overcome the prob- 
lem of complete recomputation. To compute the shortest path for dynamic edge 
costs, an algorithm called, dynamic single source shortest path (DSSS) is pro- 
posed in [53] . This approach requires pre-computation of shortest paths for each 
source. King's approach for dynamic all pair shortest path, APSP 16 requires 
pre-computation for each pair of nodes. The former one requires high memory 
space where as the latter has the limitation that all edge weights need to be 
integers bounded by a small constant. 

To accommodate time dependent road conditions in the shortest path calcu- 
lation, Gonzalez in '13] uses a traffic mining approach to determine the traffic 
patterns using the historical data for different routes at different times. Simi- 
larly, Kanoulas et al. |15j use speed patterns of previous days to compute most 
efficient path for a certain time. 

Some techniques |ll|10|3j rely on graph preprocessing under the assumption 
of static conditions (e.g., landmark) to accelerate query response times. Reach 
based routing [IT] enhances query responses by adding shortcut edges to re- 
duced nodes' reaches during preprocessing. Landmark indexing [10) and transit 
routing [3] boost run time query performance by using precomputed distances 
between certain set of landmarks chosen according to the algorithms. An adap- 
tation of landmark based routing in dynamic scenarios JjJ yields improved query 
times but requires a link's cost not to drop below its initial value. A dynamic 
variant of highway node routing gives fast response times but can handle a 
very small number of edge weight changes. 

Recently an approach for continuous route planning queries over a road net- 
work [2D] has been proposed which overcomes the limitations of the precom- 
putation based algorithms. This approach proposes two classes of approximate 
techniques K-paths and proximity measures to speed up processing of the set of 



designated routes specified by continuous route planning queries in the face of 
incoming traffic delay updates. Rather than recomputing on every update, this 
technique sends the user a new route only when delays change significantly. How- 
ever to facilitate this system a huge exterior hardware setup needed. For example 
to get raw GPS data 30 taxi cabs were deployed in a city. Then the raw data 
gets preprocessed for identification of traversed road segments and estimation of 
delays. 

Although in some of the above approaches the dynamic road conditions is 
considered, all of these existing approaches treat each user query individually 
and identification of the similarities among users' queries is beyond their scope. 
In this paper, we propose an algorithm that considers the group behavior of 
the queries and calculate the shortest path for a large number of queries. The 
idea is to reduce the computation cost by grouping queries. There are several 
clustering algorithms 0, which use properties such as density, similarity, etc. to 
cluster nodes, lines and trajectories. However, the use of clustering techniques 
in shortest path calculation has not been addressed so far. 

3 Group Based Path Queries (GBPQ) 

We propose an efficient approach to compute group based path queries (GBPQ) 
on road networks. A path query takes a source and a destination as inputs and 
returns a sequence of road segments that minimizes the total travel cost (e.g., 
travel time) from the source to the destination. Given a set of path queries, 
group based path queries (GBPQ) cluster n queries into groups and evaluates 
the group of path queries collectively. 

3.1 Intuition 

The basic idea of our approach is developed based on a key observation of road 
networks, i.e., path coherence. Path coherence is a property that shows that the 
shortest paths from nearby sources to nearby destinations share a large common 
path among them. That is, two nearby source vertices Si;s2 will have large 
road segments common of their shortest paths to reach two nearby destination 
vertices di;d2- Thus, if we can group these source-destination pairs together, 
we can reduce the computational overhead significantly by the shared execution 
of these queries. To group these source-destination pairs, we use the query line 
(Q-line) similarities. A Q-line is a straight line connecting a source (e.g., Si) 
and a destination (e.g., di). Based on the path coherence properties, it is highly 
likely that queries who have similar Q-lines will have a large portion of common 
travel path. Hence, we propose a group based approach that groups source and 
destination pairs based on their Q-lines similarities and execute the group query 
on the road network. 

It is highly likely that queries who have similar Q-lines will have a large 
portion of common travel path. Hence, we propose a group based approach 
that groups source and destination pairs based on their Q-lines similarities and 
execute the group query on the road network. 



3.2 Solution Overview 



Our approach GBPQ for processing shortest path queries groups source desti- 
nation pairs into different clusters, based on the Q-lines, the connecting lines 
between the source and destination pairs. Source region and destination region 
are obtained by combining the sources and the destinations, respectively, of the 
same group. These regions are treated as the source and destination of all the 
points of their corresponding clusters. A region acts like a virtual super-node 
having region exit paths as its edges. Next, the weighted shortest path from the 
source region to the destination region is computed, where the weights refer to 
the travel cost (travel time, or travel length) of the path. Finally, the shortest 
path for every source-destination pair that belongs a cluster is computed by con- 
catenating three path fragments: (i) shortest path from the source point to the 
starting point of the source-destination region shortest path, (ii) source region 
to destination region shortest path of the cluster, and (iii) shortest path from 
the end point of source-destination region shortest path to the destination point. 




(a) 



(b) 



(c) 




Fig. 1. An example of finding shortest paths in a group based framework. 

Figure [l] illustrates the procedure with eight path queries: (a) obtaining Q- 
lines from initial source and destination points of path queries, (b) clustering 
source and destination points and creating region pairs {-R^, -R;^} and {i?^, ^^}' 
(c) finding shortest path between two regions, and (d) adding up internal paths 
within a region to obtain the complete path. 

Note that since we have considered only the best path between a source region 
and its corresponding destination region for a group, this path may not be the 
best path for every source-destination point pairs that belong to this group. Thus 
individual query in GBPQ may result in slightly larger path then the optimal 
shortest path. Our goal is to keep the deviation from optimal shortest path as low 
as possible. Our extensive evaluation shows the deviation of the path returned 
by GBPQ from the optimal path is only about 0.5% in the average case, which 
is within the acceptable limit of the users. 



4 Algorithm 



In this section, we present our algorithm for group-based path queries (GBPQ). 
The input to the algorithm is a set of n path queries, SD = {(si,(ii), (82,^2)1 



. . . , (s„, dn)}, where (s^, di) represents a path query from a source Si to a desti- 
nation di for 1 < i < n, and the output of the algorithm is a set of approximate 
shortest paths, P — {pi,P2, ■ ■ ■ ,Pn}, where pi represents the approximate short- 
est path for the path query {si,di). Though there exists algorithms j31.22j that 
find the optimal answers for path queries, applying those algorithms indepen- 
dently for each query at a time incur high computational overhead, which causes 
a barrier in answering a large number of simultaneous path queries, especially in 
high load conditions. Thus, we propose a shared execution strategy that group 
similar queries using some key features of road networks. The algorithm first 
finds a common shortest path with respect to each group of path queries and 
then computes the approximate shortest path for each individual path query 
(si^di) based on the common shortest path of the group. Our approach sacri- 
fices the accuracy of the query answers slightly, i.e., computes a slightly larger 
path than the optimal one, in turn for significant savings in computation time. 
In Table [l] we have summarized different symbols used in this section. 

Table 1. Symbols used in our approach. 



Symbol 


Description 


SD 

L 

P 


A set of n path queries {(si, di), (s2, d-z), . . . , (s„, d„)} 
A set of Q- lines {l\,l2, ■ . ■ ,ln} 

A set of approximate shortest paths {p\,p2, . . . ,p„} 


C 
R 
SP 


A set of clusters {ci, C2, . . . , c™,} 

A set of region pairs {{Rl , R\) , {RI , Rl) , . . . , [R'T , R^)} 
A set of weighted shortest paths {spi, sp2, . . . , spm} 


A 

^ 


Half length of the side of areas of influence 

Distance threshold to group queries while forming clusters 

Minimum required number of queries to form a cluster 



Algorithm [T] Evaluate.GBPQ, gives the pseudo code for processing GBPQ. 
The algorithm finds the set of shortest paths, P, in three steps: (i) Q-line for- 
mation (Lines [l] I ~[l]2), (ii) Q-line clustering and region formation (Lines [l]3 
-[1J4) and (iii) path calculation (Lines[T]5 - [lj9). We discuss the details of three 
steps in the following sections. 

4.1 Q-line Formation 

We define Q-line, li, as the straight line connecting the source Si and the desti- 
nation di of a path query. The concept of Q-line is used to predict the similarity 
of path queries, whose answers share common paths. The algorithm computes 
the set of Q-lines, L = {li,l2, ■ ■ ■ ,ln} in Line 1.2, which is used for clustering 
the path queries in the second step of the algorithm. 

4.2 Query Clustering and Region Formation 

Algorithm [l] clusters the path queries based on the similarities of Q-lines and 
then computes the source and destination region pair for each cluster. A source 



Algorithm 1: Evaluate_GBPQ(S'L») 



Input : SD = {(si, di), (s2, 6(2), • • • , (sn, d„)} 
Output: P = {pi,P2, ■ ■ ■ ,Pn} 

/* Q-line formation */ 

1.1 for each {si,di) G SD do 

1.2 1^ U <^ getStraightLine{si,di) 

I* Query clustering and region formation ♦/ 

1.3 C ClUSTER.QuERIES (il,/2, ■ ■ ■ 

1.4 Compute {{R\,R\), (RIrI),..., {RT, R7)} 

/* Path calculation */ 

1.5 for each {Ri,R^^) e R do 

1.6 1^ spj -f- weightedShortestPath {R{,R-'^ 

1.7 for each {si,di) G SD do 



1.8 



Find j such that Sj G 7?^ and di G 

Pi C0NSTRUCT_PATH {Si,di,spj) 



and destination region pair consists of a source region and a destination region, 
where the source region (destination region) of a cluster is a minimum bounding 
rectangle (MBR) containing the locations of sources (destinations) of all path 
queries in the cluster. Note that we use MBRs to represent source and destination 
regions because the tighter the regions the less is the deviation of the computed 
path from the optimal one. 

Algorithm [1] finds the set of clusters C = {ci, C2, . . . , c™} using the function 
Cluster _Queries (Line 1.3), where m < n. Algorithm [2] shows the steps for 
Cluster ^Queries. The detail discussion of Function Cluster ^Queries is given 
in Section [4. 2| To measure the similarity among Q-lines, we define two metrics: 
(i) distance function, and (ii) areas of influence, which are used in Function 
Cluster jQueries. Therefore, we first explain distance function and areas of in- 
fluence in Section [42] and Section [4?2| respectively. 

After clustering the queries, Algorithmfl] (Line 1.4) computes the set of source 
and destination region pairs, R = {{Rl,R^, (i?^, i?^), . . . , (i?™, -R™)}, where RP^ 
and i?;^ represent source region and destination region, respectively, of cluster Cj 
for \ < j < m. 

Distance function: A distance function is used to measure the similarities 
among Q-lincs of user queries. We use three distance measures and combine them 
to measure the distance between any two Q-lines. Three distance measures are: 
(i) parallel distance dy , (ii) perpendicular distance d_L and (iii) angular distance 
dg. We calculate the distance between two Q-lines by using the following formula: 

distance = wj_dj_ + + wgdg (1) 

The weight values w±, w\\ and wg are used to control the effective contribution 
of three components on the overall distance. For example, a larger value 



reduces the length difference between a Q-line and its projection on the other 
Q-Hne. Similarly a larger value of w± keeps the endpoints of two queries closer to 
each other. In our experiments, we keep all these weights to unity, so that all the 
three components of the distance function have equal effect on overall distance. 
We group user queries based on the distance function. Two path queries can be 
grouped together if their distance is less than a threshold value tp. The formal 
definitions and impact of parallel, perpendicular and angular distances [9] are 
discussed below. Symbols used in the definitions are shown in Figure [2^. 

Parallel Distance: Let Sj and dj be the two endpoints of the Q-line Ij. 
If the projection of Sjdj over li is PsPd, then the parallel distance is defined 
as maximum of the Euclidean distance of Si to ps and di to pd as shown in 
Equation [2j 

d\\= MAX{s,Ps,Pdd.,) (2) 

Perpendicular Distance: Let /j^i and I ±^2 be the distance components of 
two Q-lines U and Ij as shown in Figure [2j Then the perpendicular distance of 
these two Q-lines is defined with second order Lehmer mean [5] of and l_\_2 
as shown in following equation. 

72 , 72 

d^ ^ ^I±i^ (3) 

Angular Distance: Let 6 be the smaller intersecting angle between Q-lines 
li and Ij . Then their angular distance is defined as product of sin component of 
the larger Q-line. Mathematically defined as following equation. 

de = MAX{li, Ij} X sinO (4) 

In summary, a smaller value of d\\ ensures that the difference between the 
length of one Q-line and the length of the projection of another Q-line over the 
first one to be smaller. On the other hand, with the increase of the value of , 
the endpoints of two Q-lines move farther away from each other. So when both 
of these distances, dy and fi_L, proceed towards zero, query locations get close to 
each other and query distances, the straight line distance between a source and 
a destination, converge towards equality. Finally, the angular distance de checks 
the parallelism between two Q-lines. When = two Q-lines are parallel to each 
other. Thus, for exactly similar two Q-lines, the overall distance value is zero. 

For clustering queries, similarities among Q-lines are measured using the 
distance function. However, computing the similarities for every pair of Q-lines 
will be prohibitively expensive, specially for a large number of Q-lines. Moreover, 
it is unnecessary to compute the similarities between two Q-lines if they are far 
apart from each other. Thus, to decide on which Q-lines should be compared 
with each other, we introduce the concept of areas of influence. The areas of 
influence helps to prune a large number of Q-lines while forming clusters' of 
queries and thus reduces the computational overhead significantly. 




Fig. 2. (a) Components of the distance functions, (b) Influence area of a Q-line 



Areas of influence: We define tlie areas of influence of a Q-line as a pair of 
regions in which, if another Q-line is present, there is a high probability that the 
distance between those two Q-lines is smaller than the threshold value ip- These 
regions are represented as two squares centering the two endpoints of the Q-line. 
Now, if another Q-line has its source and destination inside these two squares, 
respectively, then we say the later Q-line is in the same region/area with the 
first one and compute the distance between these two Q-lines to check whether 
they belong to the same cluster. 

We define the side length of both squares as 2A. Thus the value of A defines 
the size of the areas of influence. A larger value of A results in many unwanted 
Q-lines to be included for distance calculations and a smaller value results in 
a large number of small clusters. Since the appropriate value of A depends on 
the query set, we choose a A through a empirical study in the experiment. We 
find a suitable Figure shows the properties of areas of influence. There are 
various other options to choose the type of the region such as rectangle, circle or 
ellipse. Our approach is independent to the shape of the influence area, however, 
a proper distance function may need to be developed for a chosen shape. 

The concepts of distance function and areas of influence form the bases of 
our algorithm Cluster -Queries. 



Cluster queries: The input to Algorithm [2] is the set of Q-lines, L = {li^l^, 
...,/„}. We take each Q-line I from L and find the set of Q-lines / that are inside 
Vs areas of infiuence (Line[2]4). The initial representative Q-line , r, of the set I 
is computed in Line[2]5. The representative Q-line of a given set of Q-lines is the 
average direction vector of those Q-lines. Average direction vector V of the set 
V = {«!, W2, ■ • • , is calculated with the following equation: V = "^"^^^^j""*""" , 
where \V\ is the cardinality of V. 

Now, for every element of /, we first calculate its distance with the current 
representative Q-line. If the value is less than or equal to ■0) we take this Q-line 
to be added to the new cluster. The representative Q-line r is then updated to 
show the effect of newly added Q-line (Lines[2j6 -[2]l0). We use a moving average 
process to update r. For example, if r is representing n queries of a cluster, then 
when a new Q-line i is added into the cluster the new value of r will be updated 
as ''*""|^ . We take moving average because in some cases it may happen that the 
set I contains Q-lines that should have been clustered in different two groups. 



However, as the initial representative Q-line is the average of all Q-lines in the 
set, Q-lines of both the probable clusters may have lesser distance than ?/;. When 
r is updated with the moving average of clustered Q-lines, chance of r shifting 
towards one cluster increases, depending on the sequence of Q-lines that are 
updating r. In the worst case, if Q-lines form both probable clusters update r 
alternatively, the improvement might be very low. The Q-lines included in the 
new cluster are removed from L. 

Algorithm 2: CLUSTER_QuERiES(i) 



Input : A set of Q-line L = {/i, Z2, • • ■ 
Output: A set of clusters of queries C 

2.1 L' ^ Null ■ 

2.2 for each I G L do 



In} 

= {ci,C2, 



0} 



2.3 
2.4 
2.5 
2.6 
2.7 
2.8 
2.9 
2.10 
2.11 

2.12 
2.13 
2.14 
2.15 
2.16 



Q Null ; 
/ 4— suhsetWithinAreaO f Influence{L, I) 
r -ir- representativeQ-line(I) 
for each i £ I do 

if distance{r,i) < tp then 
Q ^ i 
Update(r) 
Remove i from L 



if size(Q) < /i then 
I CreateJC luster {Q,r) 

else 
I L' 



II backup list 
// list for current cluster 



2.17 for each q £ L' \J L do 



2.18 
2.19 

2.20 
2.21 
2.22 
2.23 

2.24 
2.25 
2.26 



for each c£C do 

if distance{representativeQ-line{c) , q) < ifj and q is not classified 
then 

c ^ g 
Update(r) 

Mark q as classified 



if 



is not classified then 
I C reateJC luster {{q} ,q) 
Remove q from L' or L 



However, if the distance value is greater than ■0 we simply discard that Q- 
line. Also, a new cluster is only created if there are at least /i number of Q-lines 
present. If there are not sufficient number of Q-lines then we put them in a back 
up list L' and classify them later (Lines[2jl2 -[2jl5). 

When the initial clustering process is completed, we have some Q-lines which 
are not included in any cluster. Some Q-lines are not clustered because the rep- 
resentative Q-line was initially at a greater distance than ^ and later it did not 
fall into others' influence areas, i.e., the remaining elements of the set L. The 



other set of Q-lines L' may be left unclustered because there are not sufficient 
amount of queries to form a new cluster (Line[2[l5). For all these queries, we 
initially check them with already created clusters. If we find any cluster with 
lesser distance than ip, we add the query into that cluster and update its repre- 
sentative Q-line. Otherwise a new cluster is created and the Q-line is removed 
from L or L' (Lines |2l 17 -[2]26). 



4.3 Path Calculation 

The final step of Algorithm [T] is to compute the shortest path for every path 
query, which is done in two phases. The algorithm first computes the weighted 
shortest path for each cluster for its source-destination region pairs as shown in 
Figure [3^ (Line 1.6). Then the algorithm finds the approximate shortest path 
for each individual path query in the cluster using Function Construct _Path as 
shown in Figure [3]d (Lines 1.8-1.9). 

The first phase, i.e., region to region shortest path is computed as follows. 
Essentially, as discussed earlier, a region pair consists of two MBRs, a source 
region MBR and a destination region MBR. For a source region, exit points of 
the region are identified by considering the outgoing edges of the region. On the 
other hand, for a destination region, entry points are identified by considering 
the incoming edges to the region. These regions act like virtual super-nodes, 
where exit ( or entry) paths from the regions are the edges of those nodes. Then 
we apply an heuristic based approach. A* search algorithm, to find the shortest 
path between a source region to a destination region. 




Sniirce MRR Source MBR 



(a) (b) 

Fig. 3. (a) Finding the weighted shortest path, (b) Constructing shortest path by 
adding road segments within regions. 

The second phase of connecting source and destination points to the cor- 
responding region's shortest path is computed as follows. We use simple CON- 
STRUCT.Path procedure to find the road segments within a region. For each 
query (si^di), two path fragments are computed. One at its source region from 
source point Si to the start point of the path spi. Another is at destination re- 
gion from the end point of the path spt in to destination point di. Here spi is 
the weighted shortest path between source and destination regions. Algorithm |3] 
summarizes the process. 



Algorithm 3: Construct.Path (sj,dj,spj) 

Input : A source Si, a destination di, weighted shortest path spj 
Output: A path pi 

3.1 /i <— shortest path between Si and start point of spj 

3.2 /2 <— shortest path between di and end point of spj 

3.3 Pi ^ /l + spj + /2 

3.4 return pi 



4.4 Discussion 

So far we have assumed that there are n submitted queries in the system and 
then we apply our clustering technique to group these n queries into different 
subgroups . The value of n can be determined by the user given threshold, i.e., 
how long a user can wait for the query answer. Further optimization such as 
re-using the cluster for future incoming queries in the system is the scope of 
future study of the paper. 

Lastly, we compute the shortest (or fastest) path between a source region 
and its destination region. Every query in that source region uses the same 
path to reach its destination. This may causes congestion of the traffic on that 
route. Also that particular path might not be optimal for all the queries of that 
region. To overcome this problem, an alternative approach can be to calculate 
k shortest paths instead of only one path for each source-destination pair. Then 
each individual query in those region can use the best one for which the travel 
time is minimum. This will result in accurate paths but overall processing time 
will be slightly higher than GBPQ. The detailed study of using the k shortest 
path in evaluating group based path queries is the scope of our future research. 

5 Experimental Study 

In this section we evaluate the performance of our proposed algorithm by varying 
a wide range of parameters. We compare our group based path queries (GBPQ) 
approach with the naive approach that executes each query individually using 
A* algorithm [551. We simulate our experiment on a system with Intel core i5 
2.67 GHz processor and 4 GB of memory running Windows 7 ultimate. The 
language C-I--I- is used to implement our algorithms. 

A road network dataset of North America with 175,813 nodes, 179,179 edges, 
and a diameter of 18,579 units is used. At the beginning of experiments, the entire 
map data is loaded into the memory. For a single path query we need to select 
a source-destination pair on the map. However to simulate a group behavior we 
first partition the entire data space into a number of square windows. Then, 
we choose two random windows, one as a source region and the other as a 
destination region for a group of path queries, who have their source locations 
and destination locations inside the source and destination regions, respectively. 
Within a region, query points are generated using two distribution, Gaussian 



distribution and Zipf distribution. The effect of distribution increases with the 
size of windows. For example, when the window size is 10000x10000, there is no 
partition as there is only one window covering the whole data space. In such 
a case we actually have all the queries distributed in the whole map by using 
either Gaussian distribution or Zipf distribution. Since we consider Euclidian 
space while generating queries, i.e, source-destination pairs, we map each point 
location (source/destination) to the nearest node of the road network. 

Table 2. List to parameters 



Description and Symbol 


Range 


step Size 


Default 
Value 


Number of queries n (x 1000) 


10 - 100 


10 


50 


Window length lo 


100 - 10000 




100 


Minimum query distance coefficient dc 


0.1 - 1.0 


0.1 


0.4 



5.1 Parameter Tuning 

We use three pruning parameters in our experiment: i) half length of an areas of 
influence A, ii) minimum distance threshold ip, iii) minimum number of queries 
fj,. Half length of an areas of influence defines two surrounding regions around 
source and destination points. Queries in the same areas of influence have higher 
probabilities of belonging to the same cluster. Moreover, as per our distance 
measure, the maximum allowable distance between two Q-lines increases with 
the increase of A. The performance of GBPQ also depends on the distance 
threshold value ip. The parameters tjj and A are also correlated as a higher value 
of ip allows us to take a larger A. A is used to select the initial subset selection and 
then Ip determines which of these queries is used to create a cluster. The other 
parameter fj, determines the minimum cluster size, and thus has an impact on the 
number of clusters. However, choosing effective values of these parameters is a 
challenge. Thus, we resort to detailed experimental study to choose appropriate 
values of these parameters. 

For a sample set of 50000 queries (1000 clusters x 50 queries each) we find 
different performance measures such as processing time, number of clusters and 
the number of initially unclustered queries. Findings of these experiments are 
discussed bellow. 

Effect of half length of an influence area A and distance threshold ip: 

We vary A from 70 to 130 with a step size of 10 units. Thus, for this range of A 
values, the influence region will have a length range from 140 to 260. In this set of 
experiments, our queries arc generated in clusters having the window size of 100 
units. Considering the size of the clusters we created we can estimate expected 
range of A and ip. We have created the clusters in 100 sq. units windows, so to 
inscribe that square in another square we need, in the worst case, a square with 
100-\/2 sq. units. Our expected 2A should be close to this value. However, as it is 
not possible to determine the center of windows or the locations of the partitions 



we may need slightly larger 2A than this. For ip^ consider a query having one 
same endpoints and for other ends one at the center and another at the corner 
of the respective window. In this case, queries may have distance ranges from 
100 (50 + 50) to 170 (50 + 50 + 50^/2) which gives the initial intuition about 
the range. Number of cluster created for different values of this parameters are 
observed. 

Figures [4^, [4j3, [4];, [4]i show the number of clusters created by varying the 
values of A and ip, when fj, is fixed at 1, 10, 20, 30, respectively. The time required 
for the clustering is on an average 6 seconds. 

Figure |4] shows that for higher values of A the number of clustered decreases. 
The reason is that when we select a greater influence area, more queries fall in 
the same area, which makes the average number of queries per region greater 
than that of a smaller influence areas. For a fixed A, when we increase the value 
of "0, tfie number of clusters decreases. 
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Fig. 4. Effect of A and ip when at least (a) 01 (b) 10 (c 
to form a new cluster. 
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We also compute the number of initially unclustered queries. Here, Q-lines 
of queries that are not clustered in the first phase of our Algorithm [2] (Lines [2]6 
-[2jl5) is called initially unclustered queries. The number of initially unclustered 
queries increases when the value of fj, increases, and it decreases slightly with the 
increasing values of A. For example, when A = 80 and ip = 160 the number of 
unclustered queries are 0, 1416, 3428, 5667, 9267, for n values of 0, 10, 20, 30, 40, 
respectively. But for A — 130 and tp — 160 the number of unclustered queries 
are 107, 1150, 2922, 4243, 7758, respectively. When the value of A is smaU there 
are less unwanted queries (that should belong to another cluster) in the cluster 
but the increase of the value of leaves some queries initially unclustered. 

With the increase of A, the probability of fulfilling the constraint on the 
minimum number of queries fj, increases. However, with the increase of A the 
number of unwanted queries in a cluster also increases. Thus, the value of A 
should not be chosen too high or too low. Experimental results (Figure|4]) validate 
this claim. We can see from the figure that 80 should be chosen as default of A 
as it gives consistent performance while varying other parameters. 

From Figure |4j we can see that the graphs are more stable for n values of 20 
and 30. In these cases, for ip values of 140 and 160, the variation due to different 
values of A is insignificant. Since, the number of clusters for ip = 160 is smaller 
than that of V' = 160, we can choose 160 as our default value for ip. Also, since 



for different values of A and ijj, the range of fi between 20 and 30 gives the best 
performance, we can choose any value in this range as the default value of /i. 



Effect of minimum number of queries fi: Figure [5] shows the effect of 
minimum number of queries required to form a new cluster for A — 80. We 
see that when /x = the number of clusters is very high. It decreases with the 
increase of the value of and remains almost constant when we choose values 
closer to half the value of generated queries in each clusters. For experiment we 
generate clusters with 50 queries each. Thus we can see that for both /i = 20 
and /i = 30 the difference of number of clusters is insignificant. If we further 
increase the value of /i to 40, the time required for clustering is increased. In 
this case most of the queries are not classified in the first phase, so more time is 
required to classify. Also a large number of clusters contain a very small number 
of queries created in the second phase of Algorithm [2] (Lines [2j 17 -[2]26). 
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Fig. 5. Effect of fi, for A = 80. 



We have shown that the number of unclustered queries for different values of 
H- Around 20% of queries remain unclustered when /x = 40. For /i values of 20 and 
30, the number of unclustered Q-lines are approximately 10%. Again, Figure |4] 
shows that for fj, = 30, the effect of A and tp is minimum. When ip equals to 160, 
we have a less number of cluster count. Consider all these performance issues 
we choose A = 80, -0 = 160 and /i = 30 as default values for our performance 
evaluation experiments. 
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Fig. 6. Effect of number of queries (a-b) Gaussian and (c-d) Zipf distributions. 



5.2 Performance Evaluation 



In this section, we show the performance of our algorithm based on the two per- 
formance metrics: processing time and the average percentage of deviation of the 
answer from actual shortest path. Processing time is the total query response 
time including the clustering time for a given number of queries. Clustering time 
is about 6 seconds on average for 50000 queries and shows a linear relationship 
with number of queries for the selected parameter values. Average deviation 
percentage is calculated with the formula x 100% where dtr is the total 

distance returned and dta is the actual total shortest path distance for given 
number of queries. We vary the number of queries n, the minimum query dis- 
tance, i.e., the distance between the source point and destination point of each 
query, and the window size a;, while comparing our approach with the naive 
approach. We determine the minimum query distance of a query by multiplying 
dc with lOOy'uj. We calculate the minimum query distance as a function of w, 
because it allows us to have queries within the same window when the window 
size is larger and span the queries into several windows when window size is 
smaller. The range, step size and default values for these parameters are listed 
in the Table H 

Effect of number of queries: We vary the number of queries n in the range 
of 10000 to 100000 with a step size of 10000 units. For both Gaussian and Zipf 
distributions we see that the processing times for GBPQ rises slightly with the 
increase of the values of n (Figure [6^, [6]:) . Whereas, for the naive approach, the 
processing time increases significantly with the increase in the number of queries. 
When the value of n is 10000, the processing times for GBPQ and the naive 
approach are approximately 100s and 1000s, respectively. With an increased 
value 100000 of n, the processing times for GBPQ and the naive approach are 
approximately 1000s and 13000s, respectively. Thus, we see that our algorithm 
outperforms the naive approach by a greater margin for an increased value of n. 
On an average GBPQ is twelve times faster than the naive approach. Moreover, 
our experimental results show that on an average the deviation of the answer 
path returned by GBPQ from the actual shortest path is only around 0.4% 
in case of Gaussian distribution (Figure [6b) and around 0.5% in case of Zipf 




(Figure |6|l) . 

Effect of query distance: In this set of experiments, we compare our approach 
with the naive approach by varying the minimum query distance. For this, we 
vary the minimum query distance coefficient dc for generating queries in our 
experiments and measure the processing time for GBPQ and the naive approach. 
Figure [7^ shows the results for Gaussian distribution of query points. We see 
from the figure that for GBPQ the processing time slowly increases with the 
increase of the value of dc- On the other hand, the processing time increases 
significantly for the naive approach as the value of dc increases. This is because, 
a higher value of dc corresponds to a longer distance between the source and 




destination and the number of nodes traversed for such a query is higher than 
that of the query who has a smaller dc- Thus, the processing time increases with 
the increase of the value of d^. The accuracy of our GBPQ increases with the 
increase of dc- The percentage of deviation of the answer path from the actual 
path reduces from 0.5% to 0.1% when the query distance increases from 1000 
units to 10000 units (Figure [7)d). 

The results for Zipf query distribution (not shown) shows exactly similar 
behavior as Gaussian distribution. 
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Fig. 7. Effect of the minimum query distance, (a) processing time (b) percentage of 
deviation; Effect of window size, (c) processing time (d) percentage of deviation 

Effect of v^rindow size: In this set of experiments, we vary the window length 
w and compare the performance of our approach with the naive approach. Fig- 
ure [Tj: shows the processing time of GBPQ and the naive approach for Gaussian 
distribution of query points. We see the processing time increases with the in- 
crease of the value of a;. For example, when there is no partition i.e. the window 
length is equal to the length of the data space, the processing time of GBPQ is 
62.3% lower compared to that of the naive approach, whereas, for 100x100 sized 
windows the processing time of GBPQ is 92% lower compared to that of the 
naive approach. 

Figure [7}l shows the deviations of GBPQ's answers for different values of oj. 
We find that the deviation is restricted between 0.29% and 0.56%. When the 
value of w is low, the number of clusters created is high. This causes a lower rate 
of deviation for a higher value of to. For a higher value of oj, we can see a slightly 
higher deviation than that of a lower co as the number of clusters created is low 
in this case. 

The results for Zipf query distribution (not shown) shows exactly similar 
behavior as Gaussian distribution. 



6 Conclusion 



In this paper, we have proposed a group based approach for processing a large 
number of simultaneous path queries in a road network. Our approach is based 
on a novel clustering technique that groups queries based on the similarities of 
their Q-lines. We introduce two concepts: the distance function and the areas of 
influence, that helps us to effectively cluster similar queries and execute them 



as a group. Our group based approach to evaluate a large number of simultane- 
ous path queries provides a cost-effective solution with reduced computational 
overhead. 

Extensive experimental studies show the efficiency and validate the effec- 
tiveness of our proposed algorithm. The group based heuristic in our approach 
reduces the computational overhead significantly and thereby answers a large 
number of simultaneous queries in real time. Our experiments have shown that 
on an average our shared execution approach is ten times faster than a tradi- 
tional approach, where each path query is evaluated individually. Our approach 
achieves this huge superiority at the cost of sacrificing the accuracy by a negli- 
gible amount of 0.5% in the worst case. 
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